30 research outputs found
Doubly-Attentive Decoder for Multi-modal Neural Machine Translation
We introduce a Multi-modal Neural Machine Translation model in which a
doubly-attentive decoder naturally incorporates spatial visual features
obtained using pre-trained convolutional neural networks, bridging the gap
between image description and translation. Our decoder learns to attend to
source-language words and parts of an image independently by means of two
separate attention mechanisms as it generates words in the target language. We
find that our model can efficiently exploit not just back-translated in-domain
multi-modal data but also large general-domain text-only MT corpora. We also
report state-of-the-art results on the Multi30k data set.Comment: 8 pages (11 including references), 2 figure
Latent Variable Model for Multi-modal Translation
In this work, we propose to model the interaction between visual and textual
features for multi-modal neural machine translation (MMT) through a latent
variable model. This latent variable can be seen as a multi-modal stochastic
embedding of an image and its description in a foreign language. It is used in
a target-language decoder and also to predict image features. Importantly, our
model formulation utilises visual and textual inputs during training but does
not require that images be available at test time. We show that our latent
variable MMT formulation improves considerably over strong baselines, including
a multi-task learning approach (Elliott and K\'ad\'ar, 2017) and a conditional
variational auto-encoder approach (Toyama et al., 2016). Finally, we show
improvements due to (i) predicting image features in addition to only
conditioning on them, (ii) imposing a constraint on the minimum amount of
information encoded in the latent variable, and (iii) by training on additional
target-language image descriptions (i.e. synthetic data).Comment: Paper accepted at ACL 2019. Contains 8 pages (11 including
references, 13 including appendix), 6 figure
Incorporating visual information into neural machine translation
In this work, we study different ways to enrich Machine Translation (MT) models using information obtained from images. Specifically, we propose different models to incorporate images into MT by transferring learning from pre-trained convolutional neural networks (CNN) trained for classifying images. We use these pre-trained CNNs for image feature extraction, and use two different types of visual features: global visual features, that encode an entire image into one single real-valued feature vector; and local visual features, that encode different areas of an image into separate real-valued vectors, therefore also encoding spatial information. We first study how to train embeddings that are both multilingual and multi-modal, and use global visual features and multilingual sentences for training. Second, we propose different models to incorporate global visual features into state-of-the-art Neural Machine Translation (NMT): (i) as words in the source sentence, (ii) to initialise the encoder hidden state, and (iii) as additional data to initialise the decoder hidden state. Finally, we put forward one model to incorporate local visual features into NMT: (i) a NMT model with an independent visual attention mechanism integrated into the same decoder Recurrent Neural Network (RNN) as the source-language attention mechanism. We evaluate our models on the Multi30k, a publicly available, general domain data set, and also on a proprietary data set of product listings and images built by eBay Inc., which was made available for the purpose of this research. We report state-of-the-art results on the publicly available Multi30k data set. Our best models also significantly improve on comparable phrase-based Statistical MT (PBSMT) models trained on the same data set, according to widely adopted MT metrics
Developing a dataset for evaluating approaches for document expansion with images
Motivated by the adage that a “picture is worth a thousand words” it can be reasoned that automatically enriching the textual content of
a document with relevant images can increase the readability of a document. Moreover, features extracted from the additional image
data inserted into the textual content of a document may, in principle, be also be used by a retrieval engine to better match the topic of a
document with that of a given query. In this paper, we describe our approach of building a ground truth dataset to enable further research
into automatic addition of relevant images to text documents. The dataset is comprised of the official ImageCLEF 2010 collection (a
collection of images with textual metadata) to serve as the images available for automatic enrichment of text, a set of 25 benchmark
documents that are to be enriched, which in this case are children’s short stories, and a set of manually judged relevant images for each
query story obtained by the standard procedure of depth pooling. We use this benchmark dataset to evaluate the effectiveness of standard
information retrieval methods as simple baselines for this task. The results indicate that using the whole story as a weighted query,
where the weight of each query term is its tf-idf value, achieves an precision of 0.1714 within the top 5 retrieved images on an average
DCU System Report on the WMT 2017 Multi-modal Machine Translation Task
We report experiments with multi-modal
neural machine translation models that incorporate global visual features in different parts of the encoder and decoder, and
use the VGG19 network to extract features for all images. In our experiments,
we explore both different strategies to include global image features and also how
ensembling different models at inference
time impact translations. Our submissions
ranked 3rd best for translating from English into French, always improving considerably over an neural machine translation baseline across all language pair evaluated, e.g. an increase of 7.0–9.2 METEOR points
Are scene graphs good enough to improve Image Captioning?
Many top-performing image captioning models rely solely on object features
computed with an object detection model to generate image descriptions.
However, recent studies propose to directly use scene graphs to introduce
information about object relations into captioning, hoping to better describe
interactions between objects. In this work, we thoroughly investigate the use
of scene graphs in image captioning. We empirically study whether using
additional scene graph encoders can lead to better image descriptions and
propose a conditional graph attention network (C-GAT), where the image
captioning decoder state is used to condition the graph updates. Finally, we
determine to what extent noise in the predicted scene graphs influence caption
quality. Overall, we find no significant difference between models that use
scene graph features and models that only use object detection features across
different captioning metrics, which suggests that existing scene graph
generation models are still too noisy to be useful in image captioning.
Moreover, although the quality of predicted scene graphs is very low in
general, when using high quality scene graphs we obtain gains of up to 3.3
CIDEr compared to a strong Bottom-Up Top-Down baseline.Comment: 11 pages, 3 figures. Accepted for publication in AACL-IJCNLP 202
Soft-prompt tuning to predict lung cancer using primary care free-text Dutch medical notes
We investigate different natural language processing (NLP) approaches based
on contextualised word representations for the problem of early prediction of
lung cancer using free-text patient medical notes of Dutch primary care
physicians. Because lung cancer has a low prevalence in primary care, we also
address the problem of classification under highly imbalanced classes.
Specifically, we use large Transformer-based pretrained language models (PLMs)
and investigate: 1) how \textit{soft prompt-tuning} -- an NLP technique used to
adapt PLMs using small amounts of training data -- compares to standard model
fine-tuning; 2) whether simpler static word embedding models (WEMs) can be more
robust compared to PLMs in highly imbalanced settings; and 3) how models fare
when trained on notes from a small number of patients. We find that 1)
soft-prompt tuning is an efficient alternative to standard model fine-tuning;
2) PLMs show better discrimination but worse calibration compared to simpler
static word embedding models as the classification problem becomes more
imbalanced; and 3) results when training models on small number of patients are
mixed and show no clear differences between PLMs and WEMs. All our code is
available open source in
\url{https://bitbucket.org/aumc-kik/prompt_tuning_cancer_prediction/}.Comment: A short version of this paper has been published at the 21st
International Conference on Artificial Intelligence in Medicine (AIME 2023
Detecting Euphemisms with Literal Descriptions and Visual Imagery
This paper describes our two-stage system for the Euphemism Detection shared
task hosted by the 3rd Workshop on Figurative Language Processing in
conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive
or unpleasant issues like addiction and death. The ambiguous nature of
euphemistic words or expressions makes it challenging to detect their actual
meaning within a context. In the first stage, we seek to mitigate this
ambiguity by incorporating literal descriptions into input text prompts to our
baseline model. It turns out that this kind of direct supervision yields
remarkable performance improvement. In the second stage, we integrate visual
supervision into our system using visual imageries, two sets of images
generated by a text-to-image model by taking terms and descriptions as input.
Our experiments demonstrate that visual supervision also gives a statistically
significant performance boost. Our system achieved the second place with an F1
score of 87.2%, only about 0.9% worse than the best submission.Comment: 7 pages, 1 table, 1 figure. Accepted to the 3rd Workshop on
Figurative Language Processing at EMNLP 2022.
https://github.com/ilkerkesen/euphemis
Towards the generation of a database for scientific research in natural language processing with an information extraction system
Dissertação de mest., Processamento de Linguagem Natural e Indústrias da Língua, Faculdade de Ciências Humanas e Sociais, Univ. do Algarve, 2013The Internet is a network of computers which dates its origins back to the 1960s. This network
of networks known as the Internet slowly began to drive organisations and people towards
interconnectedness and set the foundations of a constant growth in the quantity of information
available – an unprecedented situation until then. In parallel with this development the field of Natural
Language Processing (NLP) was receiving a fair amount of funding, specially for efforts of translation
for the Russian-English language pair.
Information Extraction (IE) is an NLP field concerned with the automatic extraction of information
from text written in natural language